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For two decades, the CoUess index has been the most frequently 
used statistic for assessing the balance of phylogenetic trees. In this 
article, this statistic is studied under the Yule and uniform model of 
phylogenetic trees. The main tool of analysis is a coupling argument 
with another well-known index called the Sackin statistic. Asymp- 
totics for the mean, variance and covariance of these two statistics 
are obtained, as well as their limiting joint distribution for large phy- 
logenies. Under the Yule model, the limiting distribution arises as 
a solution of a functional fixed point equation. Under the uniform 
model, the limiting distribution is the Airy distribution. The cor- 
nerstone of this study is the fact that the probabilistic models for 
phylogenetic trees are strongly related to the random permutation 
and the Catalan models for binary search trees. 

1. Introduction. Phylogenetic trees (PT) represent the shared history of 
extant species. The idea of using trees to model evolution dates back to 
Darwin ([10], see his diagram on page 117). In a (rooted) PT, there is a 
common ancestral species called the root and each branching represents the 
time at which a divergence has occured. A PT is usually reconstructed us- 
ing data from n different species (or taxa) which are located at the leaves. 
The tree has n — 1 internal nodes that correspond to the ancestors of the 
sample. There are two distinct features of rooted PTs. First is the branching 
structure or topology of the tree. Second is the branch lengths which corre- 
spond to periods of time separating major evolutionary events. The shape of 
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such trees carries useful information about the history of diversification rates 
among species by reflecting the footprint left by evolutionary processes. 

Biologists have extensively investigated the ways in which the shapes of 
PTs can be measured [20]. Mooers and Heard [25] wrote an exhaustive 
review concerning tree balance in systematic biology and Aldous [5] gave 
an introduction in a more mathematical setting. How these measures are 
related to macroevolution processes has been studied by Rogers [31] and 
Agapow and Purvis [1], relying upon intensive computer simulations. Thus 
far, several statistics have been introduced to measure the shapes of PTs 
(see Agapow and Purvis [1], for eight of them). Among these statistics, the 
most widespread are Sackin's and the Colless indices. Sackings index [34, 36] 
counts the number of ancestors crossed in the path from each leaf to the root. 
Colless^ index [9] looks at the internal nodes, partitioning the leaves that 
descend from them into groups of sizes L and R, and computes the sum of 
absolute values |L — i?| for all ancestors. 

The probability distributions of the Sackin and the Colless statistics have 
been investigated for various models of biologically plausible random trees. 
Two random models of PT are often considered in the literature. The most 
famous is the Yule model [38]. The Yule model is a branching process with 
constant speciation rate where the number of extant species is specified. The 
assumption of a constant speciation rate may be weakened by assuming that 
the diversification rate could vary in time but is the same for all species at 
any time. This assumption does not modify the distribution of PT shape. 
An alternative model considered by biologists is called the uniform model. It 
assumes that all PTs are equally likely. This model is biologically motivated, 
as it arises from a large family of Galton- Watson processes conditioned by 
the total size of the trees (see [2]). In addition, McKenzie and Steel [24] have 
shown that when speciation events are constrained to occur before a time r 
after their previous speciation event, the resulting process converges to the 
uniform model as r tends to zero. In both models, Rogers [31] studied the 
joint distribution of the Sackin and the Colless statistics using numerical 
computations. He concluded that these statistics were strongly correlated in 
large PTs. The limiting distribution of the Colless statistic was also conjec- 
tured to be non-Gaussian [30]. 

This article describes the mean, the (co)variance and the limiting joint 
distribution of Sackin's and Colless' indices for large PTs under the Yule and 
uniform models. Because this study is mainly concerned with the topology 
of PTs, branch lengths can be ignored. A PT is then a cycle-free connected 
graph with vertices of degree one (the leaves), two (the root) or three (all 
ancestors except the root). Leaves are usually labeled, whereas ancestors are 
not. This simplified model of phylogeny without branch lengths is sometimes 
called a cladogram (see [4]). Our proofs use the connection to recent results 
in theoretical computer science, as well as the correspondence between PTs 
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and random binary search trees (BST). This approach extends results by 
Blum and Francois [6] who showed that Sackin's index has the same limit 
distribution as the number of comparisons used by the quicksort algorithm 
[16]. More specifically, we deal with the Yule and uniform models separately. 
For the Yule model, our analysis relies on the recursive structure of the tree 
and makes use of the fixed-point method (see, e.g., [17]). This method was 
introduced in the probabilistic analysis of algorithms by Rosier [32]. In the 
uniform model, the results are based on the connection between uniform 
trees and Bernoulli excursions [37]. A large family of statistics similar to 
Sackin's and Colless' indices have been studied by Fill and Kapur [12] under 
the Catalan model for BSTs. 

In Section 2, we shall present our main results. Section 3 explains how 
probabilistic models for PTs are related to probabilistic models for BSTs. 
Section 4 is dedicated to the Yule model, while Section 5 deals with the 
uniform model. 

2. The Sackin and the Colless statistics. Consider a PT with n leaves. 
The Sackin statistic adds the number of internal nodes between each leaf 
and the root of the tree to produce the index 

n 

Sn — ^ ^ di , 

i=l 

where the sum runs over the n leaves of the tree and di is the number of 
ancestors crossed in the path from i to the root (including the root). The 
Colless statistic looks at the internal nodes, partitioning the leaves that 
descend from them into groups of sizes Lj and Rj and computing 

n-l 

Cn = ^ \Lj - Rj\, 

i=i 

where the sum runs over the internal nodes and Lj (resp. Rj) corresponds 
to the number of leaves in the left (resp. right) subtree under node j. 

Denote by A^2 the space of all bivariate, centered probability measures 
with finite second moments, and by C{X) an element of A^2- We have the 
following result: 

Theorem 1. Assume the Yule model of PT. Consider the map T ■.M2 
M2 such that for all v ^ M2, we have 

T{u) = C ( 

with 

fbs\^f 2[/logC/ + 2(l-[/)log(l-[/) + l 

\bc J \UlogU + {1-U) log(l -U) + l-2 mm{U, 1 - U) 
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where {S,C), {S',C') and U are independent random variables such that 
C{S,C) = C{S' ,C') = u and U is uniform over the interval (0,1). Then we 
have 

Sn-E[Sn\ Cn-E[Cn]\ d 



n n 



where the convergence holds in distribution and the limiting probability dis- 
tribution is the unique fixed point of the map T. 



Remark 1. The convergence in Theorem 1 will actually be proven for 
a stronger topology than convergence in distribution. As can be seen from 
Section 4, it indeed holds for the Wasserstein-Mallows (i2-metric [29] which 
guarantees the existence and convergence of the second moments. 



Remark 2. This result extends the fact that the normalized Sackin 
index 

(1) g„^^"-^'^"i 

n 

converges in distribution to the same limit as the number of comparisons 
in the quicksort algorithm. According to Rosier [32], the limit S satisfies a 
(functional) fixed-point equation of the type 

(2) S = US + {1- U)S' + 2?71og [/ + 2(1 - U) log(l -U) + l, 

where S, S' and U are independent random variables, S and S' are iden- 
tically distributed, U is uniformly distributed over the inverval (0,1) and 
the identity holds for distributions. Regarding CoUess' index, the functional 
fixed-point equation becomes 

C = UC + {1- U)C' + U\ogU 

(3) 

+ {l-U) log(l -U) + l - 2min([/, 1 - U). 



A well-known result in systematic biology is that the expectation of Sn is 
of order 2nlogn. More precisely, Kirkpatrick and Slatkin [20] showed that 

n . 

E[5„]=2n^- 

and 



E[S'„] =2nhin+{2-i -2)n + o{n), 
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where 7 is Euler's constant. Using the connection to the quicksort algorithm, 
the variance of the hmiting distribution can be obtained according to Knuth 
[21] as 

Var[5n] ~ ^7 — 2— n— >oo. 

These results can be extended to the case of Colless index as follows, taking 
into account that Sn and C„, are strongly correlated for large PTs: 

Theorem 2. Assume the Yule model of PT. Then we have 

(4) E[Cn] =nlog?i + (7-l-log2)n + o(n), 

(5) Var[C„]~ (^3-^-log2)n2, 

/fi^ n 27-2^^-61og2 _ ^ 

~ V2(18-.^-61og2)(21-2vr^) " 

as n goes to infinity. 

Regarding the uniform model of PT, mathematical results have received 
less attention than for the Yule model. After an appropriate rescaling, we 
prove the convergence of both Sn and Cn to the same marginal probability 
distribution and identify this distribution as \/8 times the integral of the 
standard Brownian excursion e(t), 

uj = e{t) dt. 
Jo 

The distribution of random variable A = ^/Soj is known as the Airy distri- 
bution. A formula for the moments of A has been given by Flajolet and 
Louchard [13]. In particular, we have 

E[^] = 

and 

10-37r 
Var[^] = . 

Theorem 3. Assume the uniform model of PT. Then we have 

(7) ^n^_£n ^ Q 

j^3/2 

and 

as n goes to infinity. 
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Remark 3. Regarding Sn, the connection to the internal path length 
of a BST enables us to immediately state that 

Sn d ^ 



This result was actually established by Takacs [37] using the method of 
moments. In addition, we find that 

E[Sn] ~ V^?l3/2 

and 

The moments of C„ follow from the next theorem. 

Theorem 4. Assume the uniform model of PT. Then we have 

and 

, 10 - Svr 3 
Var[C„] n-^ 

as n goes to infinity. In addition, the variables Sn and Cn are asymptotically 
correlated, that is, 

Cor[S„,C„]~l 

and we have, for any k,£>0, 
as n goes to infinity. 



Remark 4. While C„ and Sn are by far the most popular statistics 
used in studies of phylogenetic imbalance, other measures have also been 
considered (see [1]). Some of these can also be studied in the Yule model 
using the contraction method, mainly because they are defined as sums of 
elementary functions of subtrees over all nodes. For example, the result for 
the Fusco and Conk statistic, modified by Purvis, Katzouralus and Agapov 
[28], is left to the reader. In the same spirit, we believe that the Bi index 
of Shao and Sokal [36] could be studied without difficulties. Studying the 
remaining statistics {B2 and o"^) would nevertheless require considerably 
more effort. 
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Remark 5. In a recent large-scale study of the phylogenetic database, 
Blum and Francois [7] considered the shape statistic 

F„ = ^log(iV,-l), 
i=i 

where the sum runs over all internal nodes and Nj represents the number of 
extant descendants of internal node j . A similar statistic had been previously 
proposed by Chan and Moore [8], but the logarithm was omitted. Once the 
normalizing constant has been removed, Fn corresponds to the logarithm 
of the probability of a tree in the Yule model. In particular, the statistical 
test based on Fn is the most powerful test for rejecting the Yule model 
against the uniform model and conversely (Neyman-Pearson theorem). Fill 
[11] showed that F„ has a Gaussian distribution (for large trees) and gave 
asymptotic expansions for the means and variances under both the Yule and 
the uniform models (see also [12]). 

3. Phylogenetic and binary search trees. Trees are often encountered in 
theoretical computer science as data structures associated with divide and 
conquer algorithms. In this section, we explain how binary search trees can 
be mapped onto phylogenetic trees univoquely and how probabilistic models 
for BSTs are transported on probabilistic models for PTs. 

Mapping binary search trees. A binary tree can be defined recursively. It is 
either empty or it is a node (the root) with left and right subtrees. A binary 
search tree is a binary tree where labels are associated with the vertices. 
These labels are constrained: the label of a vertex is greater than or equal to 
all labels contained in the left subtree and less than or equal than all labels 
contained in the right subtree (Section 5.5, [35]). The transformation that 
maps BSTs into PTs can be found in [4]. Given a BST with n — 1 vertices, 
the structure is modified as follows. Vertices in the BST become ancestors 
in the PT. To accomplish this, two leaves are connected to each vertex of 
degree one and one leaf is connected to each vertex of degree two. The root 
has a special status. If the degree is 0, 1 or 2, then 2, 1 or leaves are added. 
The labels of leaves are chosen arbitrarily from the n! possible orders. 

Two obtained PTs are equivalent if their left and right subtrees can be in- 
terchanged recursively (see Figure 1). The set of PTs is the set of equivalence 
classes for this equivalence relation. Figure 2 gives a graphical representation 
of these transformations. 

A PT may therefore arise from the construction of 2"~^ equivalent mod- 
ified BSTs. Because there are Cn-i BSTs with n — 1 vertices, we obtain 
C„_in!/2"~^ possible PTs. This number coincides with the total number of 
PTs, which equals (2n — 3)!!, where 

(2n - 3)!! = (2n - 3)(2n - 5) • • • 3 • 1. 
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Probabilistic models. The mapping described in the above paragraph also 
transfers probabihstic models for BSTs to probabilistic models for PTs. 
For instance, there will be an equivalence between the random permuta- 
tion model for BSTs [22] and the Yule model for PTs. Probabilistic models 
for BSTs (with n — 1 vertices) can be described as a general class of models 
called branching Markov processes. A definition of branching Markov pro- 
cesses can be found in [4]. We recall this definition here. Let (jn-i be a 
symmetric probability distribution on {0, . . . , n — 2}: 



4n-l 



{n-2-i), 



:0,...,n- 2. 



In the branching Markov process, the size of the left subtree is chosen accord- 
ing to the probability distribution Qn-i. This procedure is repeated recur- 
sively in subtrees, assuming local independence. The probability distribution 




C D 



Fig. 1. Two graphical representations of the same PT. They are seen to he identical by 
interchanging left and right subtrees. 




Fig. 2. The transformation of two binary search trees to the same PT. The extension 
consists of connecting two leaves to vertices with outdegree and one leaf to vertices with 
outdegree 1. The two resulting trees represent the same PT. 
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Qn-i is called the splitting distribution. In the same way, probability distri- 
butions on PTs with n leaves can be associated with splitting probability 
distributions qn on {1, ... ,n — 1}. At each step, the labels of the left sub- 
tree, of size i, are sampled uniformly from the (") possible labels. At the 
end of the construction, left-right distinctions are simply suppressed in the 
building of the PT. 

Lemma 1. Assume that T„_i is a BST, sampled according to a branching 
Markov process with splitting probability (jn-i- Denote by the transfor- 
mation which consists of extending a BST with n — 1 vertices to a PT with 
n leaves. Then Tn = <I>„(T'„_i) is a PT, sampled according to a branching 
Markov process with splitting probability qn such that 

gn(0 =9n-i(^- 1), i = l,...,n-l. 

Proof. This is a consequence of the basic properties of If a BST has 
i vertices in its left subtree, the resulting PT has i + 1 leaves in one of the two 
subtrees of the root. The symmetry property of qn ensures that all members 
of the same equivalence class have the same probability of occurrence. □ 

Lemma 1 has the interesting consequence that well-studied models of 
BSTs can be transposed into models on PTs. The Yule and uniform models 
for PTs then get associated with special cases of branching Markov pro- 
cesses. 

On one hand, the random permutation model for BSTs with n—1 vertices 
is a branching Markov process with splitting probability 

1 

g„_i(z) = -, z = 0,...,n-2. 

n — 1 

This model is mapped by into the Yule model for PTs with n leaves with 
splitting probability 

9n(«) = ^-r> « = l,...,n-l. 

n—1 

The splitting distribution for Yule trees was found by Harding [15]. Note 
that the same splitting property also holds for n-coalescent tree topologies 
[19]. 

On the other hand, the Catalan model for binary BSTs with n — 1 vertices 
assumes that all C„_i binary trees have the same probability of occurrence. 
The number of trees with a left subtree of size i is equal to CiCn-2-i- The 
Catalan model is a branching Markov process where the splitting distribu- 
tion is given by 

qn~i{i) = —p; , i = 0, ...,n-2. 
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The transformation maps the Catalan model into the uniform model for 
PTs with n leaves. The splitting distribution for PTs is then 

1 ^n^ (2i-3)!!(2(n-i)-3)!! 



' 2\iJ (2n-3)!! 
This formula can also be found in [4]. 



Lemma 2. Let qn be a splitting distribution on {1, . . . ,n — 1}. Let h be 
a function of pairs of integers. Denote by X„, an additive random variable, 
defined recursively as 

Xn = Xj^ + Xj^ + h{In, Jn), 

where the I^s are sampled under a branching Markov process of splitting 
distribution qn and where In + Jn = Define by 

Xn = Xj +Xj +/l(4 + l,J„ + l), 

where the li s are sampled under the branching Markov process of splitting 
distribution qn with 

qn{i) =qn+iii + l), i = 0,...,n-l, 

and where /„ + J„, = n — 1 . Then we have 

Xn = Xn-l- 



Proof. Note that /„ = In+i — 1, that is, the distribution of In is given 
by qn- Similarly, we have 

Xn = + ^J„_i+1 + KIn-1 + 1, Jn-1 + !)• 

Setting Xn-i = Xn, we prove the result. □ 



Remark 6. This lemma states that for additive random variables built 
from a Markov branching PT of splitting distribution g„ , there exist additive 
random variables Xn built from a Markov branching BST of distribution 
In addition, Xn and Xn-i have the same distribution. This lemma can obvi- 
ously be generalized to multivariate random variables. In the next sections, 
all random variables are studied in the context of BSTs. Applying Lemma 2, 
the results can be transposed to PTs without difficulties. 
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4. Yule model. The Yule model is a branching Markov process for PTs 
with splitting probability 

9n(0 = — — r for i = 1,. . . ,n - 1. 

n — 1 

Sackin's index S„ has been defined as a sum of depths over the leaves. 
Sackin's index Sn can also be expressed as a sum over the internal nodes 
[31], 

n-l 

i=i 

where Nj is the number of leaves descending from the internal node j. Apply- 
ing Lemma 2, we obtain that 5^ has the same distribution as + 2(n — 1), 
where Sn is defined by 

(8) Sn = Si„ + S'j^+n-l, 

In is distributed uniformly over {0, ... ,n — 1} and Jn = n — 1 — In- The 
recursion satisfied by Sn is well studied since it arises from the analysis 
of the quicksort algorithm or the internal path length of a BST under the 
random permutation model. Similarly, Cn has the same distribution as C„_i, 
where 

(9) Cn = Cl^+C'j^ + \In-Jn\- 

In order to describe the joint distribution of {Sn,Cn) under the random 
permutation model, we shall follow the same lines of proof as Neininger [27] 
who studied the joint convergence of the Wiener index and the internal path 
length of a BST. 

Proof of Theorem 1. Step 1. Computing expectations. Denote Cn = 
E[Cn] and s„ = E[S'„]. We have 

(10) Sn = 2n log n + (27 - 4)n + o(n) . 
We rewrite equation (9) as 

(11) Cn = Ci,^ + C'j^+n-l-2 min(/„, Jn). 
Conditioning on /„ in the above equation, we find that 

2 ""^^ 

Cn = {n-1- 2tn) + - V Cfc, 
k=0 



12 

where 
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t„ = E[min(I„, J„) 



n-2 
An 



if n is even, 
if n is odd. 



Applying Lemma 1 of [17], page 1691, we obtain that 



n-l 



k-l- 2tk 



An asymptotic expansion of the above expression leads to the following 
result: 

(12) c„ = n log n + (7 — 1 — log 2)n + o(n). 

Step 2. Limit distribution. Let us consider the rescaled quantities 



■"n ■^n 

n 
n 
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and X'^, an independent copy of X„. From equations (8) and (11), we have 



(13) 
where 



An) _^ (In 



A 



{n) ^}_f Jn 
n V Jn 



and 



Sin +sj„-Sn + n-l 
c/„ + cj„ - c„ + n - 1 - 2 min(/„,, J„ 



Since In/n converges in toward U, a uniform variable over (0, 1), we have 



and 



A 



A 



(n) Ll 



(n) Ll 



Al 



U 
u 



A% 



1-U 
1-U 
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Using the asymptotic expansion of c„ given by (12) and the asymptotic 
expansion of Sn given by (10), we find that 

= i (2/„ log + 2J„ log + n) + 0(1) 



and 



4"^ = -f^-logf^') + J„logf^) + n-2min(/„,,J„)) + 0(1). 




n \ \n J \n 
Thus, we have 

2[/log [/ + 2(1 - U) log(l -U) + l 
U\ogU + {l- U) log(l -U) + l - 2min(C/, 1 - U) 

Assuming that X„ converges in distribution, the hmiting distribution C{X) 
must satisfy the condition 

(14) X = A\X + AlX' + h\ 

where X, X' and {Al, A2,b*) are independent and X' = X. 

The multivariate contraction theorem [26] states that there is a unique 
probability distribution J~-{X) satisfying (14) in A^2- Moreover, it states 
that the distribution of Xn converges toward the distribution of X in the 
Wasserstein-Mallow d2-metnc. The convergence in this metric is the same 
as the convergence in distribution and the convergence of second moments. 
Neininger's theorem can be applied provided that the following four condi- 
tions hold: 

(i) (^S") , , 6(")) ^ {Al,Al 6*), n ^ oo, 

(ii) E[||(^I)*^|||op]+E[||(A^)*^^||op]<l, 

(iii) E[l|,„<,|u{/„=n} II (^! )*^! Hop] ^ 0, for all ^ G N, n ^ oo, 

(iv) E[l|j„<,}u{j„=n}ll(^2)*^2llop]^0, for an^GN,n^oo, 

where ||A||op = sup||a,||=i ||^2;|| is the operator norm of A. For symmetric 
matrices (which we are considering here) this equals the spectral radius, 
(i) Has already been proved, (ii) Holds because 

n\\{AlYAl\l^]+E[\\{A*2YAl\U=E{U' + (1 - Uf] = | < 1. 

(iii) and (iv) are obvious because ||(yl*)*A*||op < 1 for r= 1,2 and 

lP({/n < n U {In = n}) = P({ J„ <£}U {Jn = n}) < ^ ^ 

n 

for all £ G N and n oo. □ 

Proof of Theorem 2. According to Theorem 1, equation (14) has 
a unique solution, so we can consider {S,C) and {S',C'), two independent 
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copies with C{S,C) = C{S' ,C') being the fixed point of (14) and U being 
uniform over (0,1). By definition, we have 

Using the fact that all random variables (except U) are centered, we find 
that 

E[C^] = E[U^C^] + E[(l - UfC'^] 

+ E[(f/log U+{l-U) log(l -U) + l - 2min(C/, 1 - U)f]. 



Thus, we have 



7r2 



Var[C^] = E[C^] = (^3 - y - log 2 

In the same way, we find that 

E[SC] = E[U^SC] + E[(l - UfS'C] 

+ E[{2U log [/ + 2(1 - U) log(l - [/) + 1) 

X {UlogU + (1 - U) log(l -U) + l - 2min(f/, 1 - [/))]. 

This leads to 

Cov(5,C)=E[5C] = ^-y-log2. 

Using the fact that E[S'2] = 7 - 27rV3 [32], we find that 

27- 27r2-61og2 



Cor(S, C) 



v/2(18 - vr^ - 61og2)(21 - 27r^) 

Theorem 1 holds in the Wasserstein-Mallows (i2-metric, which implies the 
convergence of second moments. This leads to 

Var [Sn] ~ Var [5] , Var [C„] ~ Var [C] 

and 

Cov(5„, Cn) ~ Cov(5, C)n2, Cor(5„, C„) ~ Cor(5, C). □ 

Remark 7. Lemma 2 suggests that a more general class of statistics 
could be studied using the same technique. When the toll function tn — 
h{In,Jn) varies, general limit laws for recursive random variables can be 
found in [17]. Note that we have started the proof with the guess that the 
variance was of order n (which may not be obvious in general). Readers in- 
terested in multivariate distributional recursion and convergence to a func- 
tional fixed-point solution could refer to the recent survey by Riischendorf 
and Neininger [33]. 
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5. Uniform model. For a given n, the uniform model assumes that all 
PTs with n leaves are equally likely. Again, we use the fact that Sackin's 
and Colless' indices for a PT with n leaves drawn according to the uniform 
model have the same probability distribution as Sn-i + 2(n — 1) and Cn-i, 
which are defined by equations (8) and (9), respectively. Under the Catalan 
model for BSTs, In is distributed according to qn, where 

'■ ( ■\ ^i^n—l-i ■ „ -I 

9n(«) = — , z = 0,...,n-l. 

Conditional on /„, {Si„,C[„) is independent of {Sj^,Cj^). 

Clearly, Sn has the same distribution as the internal path length of a BST 
under the Catalan model and Cn has the same distribution as the random 
variable X)j=i \^j ~ where the sum is over the n — 1 vertices of a BST 
drawn under the Catalan model. Note that Cn can be rewritten as 

J2{Nj-l)-2mmiLj,Rj), 
i=i 

where Nj is the number of vertices of the subtree rooted at j (including j) 
and Lj (Rj) is the number of vertices of the left (right) subtree. Then we 
have 

n-l 

Cn = Sn-2Y^ mm{Lj,Rj). 
i=i 

It is well known [37] that Sn/n^^"^ converges in distribution to the Airy 
distribution. The proof relies on the one-to-one correspondence between bi- 
nary trees and Bernoulli excursions. Takacs [37] computed the moments of 
SnlT?!"^ and their limiting values to establish convergence using the method 
of moments. The goal of this section is to prove the convergence 



E 



E-=imin(L„i?,) 



n 



3/2 



0, n^O. 



This implies that {Cn — Sn) jv?^"^ converges in probability to and completes 
the proof of Theorem 3. 



Lemma 3. Let n > 2 and consider a BST with n vertices under the 
Catalan model. Denote by jo the root of the tree. We have 

E[min(Lj„, iijo)] < K\/n, 

for some constant K. 
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Proof. Stirling's formula yields a well-known asymptotic expansion for 
the Catalan number C„: 

4'^ / (\ 

(15) C„ = ^= l + O - 

The expectation of min(Lj(,, i?^^) is given by 



E[min(L,,„i?,J]= ^ 2^^:^^:^ + l^^.^N+i}^ ^^''^^^ 



p ^ ll-{ng2N+l} - 

k=l 

Using (15), we find that 



[n./2j-l _i 

Cn vrC„,(n - 1) n - 1 



^"^"'2- (l + 0(lA)+0(l/n)) 



,Jl v/^/(n-l)(l - k/{n - l))3/2 • 

The sum in the right-hand side of the above equation is a Riemann sum. 
Using the fact that 

"1/2 1 



we have 

Ln/2J-1 



■\/^(l ~ x)3/^ 



Y: 2feg^g^= ; ^, (i+o(i)). 

Using (15) again leads to 

(16) E[min(4^, 4J] ~ yn/^, 

which concludes the proof of the result. □ 

By conditioning on the sizes of the two subtrees of the root and using 
induction on n, it follows that 



E 



E"=i'min(L„i2,) 



n 



3/2 



< KE 



n 



3/2 



In the following, we prove that the right-hand side of the above inequality 
converges to as n goes to 00. 

A lemma which is interesting in its own right provides the key argument. 
In the following, we use the standard convention that (q) = 1. 
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Lemma 4. Let n>l. Consider a BST with n vertices sampled accord- 
ing to the Catalan model. Pick a vertex Vat random from the n vertices. 
Denoting = Ny, we have 



\ n—k I 



lfk = l, 



,n. 



Proof. The proof relies on combinatorial arguments. Let us denote by 
Vk{T) the number of subtrees with k vertices in the BST T having |T| = n 
vertices. For A: = 1, . . . , n, we have 

1 " 

^{Kn = k) = -Y,nN3=k) 



n 



n 

El 



{Nj=k} 



n 



z^fc(T) satisfies the linear recursion 
(17) 



i^k{T) = 5\Tik + Y.^k{S), 
s 

where 5 denotes the Kronecker symbol and the sum is over the subtrees of 
the root of T. Let us denote by B the set of all BSTs. We now introduce the 
cumulative generating function defined by 

Fu{z)=Y.Vk{T)z\'^\ 



and 



From the linear recurrence equation (17), Theorem 5.7 in [35] establishes 
the following relationship between cumulative generating functions and 
Gk- 



Fk{z) 



Gkjz) 



Using the fact that 



we find that 



Fk{z)=Y. ' 



2i 



i>k 



2i-2k\ 
i-k J 



Ckz' 
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since the expectation of Ukiz) is given by 

EH(r)] = nFfc(z)/C„, 
this completes the proof of the lemma. □ 

Corollary 1. Let n > 2. Consider a PT with n leaves sampled from 
the uniform model. Let Kn denotes the number of leaves descending from a 
uniformly chosen random ancestor. We then have 



F{Kn = k) 



l2n~2k\ 



fc-lV n-k 



for k = 2, 



{n- l)C„_i 

Proof. This is a direct consequence of Lemma 2. 



□ 



Remark 8. As n goes to infinity, the distribution of Kn converges to 



\k = k)=4-''Ck^ 



1 



, for large k. 



The tail of the distribution of L{ has a power law with parameter 3/2. This 
can be compared to a similar result in the context of BSTs [23] under the 
random permutation model. In this case, Kn has power law distribution 
with parameter 2. Since 3/2 is less than 2, large random subtrees are more 
likely in the Catalan model than in the random permutation model. It was an 
expected result since Catalan binary trees are known to be more unbalanced 
than BSTs under the random permutation model (Section 5.6, [35]). 

Remark 9. Actually, the limiting distribution in the preceding com- 
ment is equal to the size of the critical Galton- Watson process with a bi- 
nomial Bi(2, 1/2) offspring distribution (Lemma 9, [3]). This is the Galton- 
Watson process corresponding to binary trees. 



We are now ready to prove that IE[X]j y Nj\/n^/'^ converges to as n goes 
to oo. Let Q G ]0, 1[ and split the sum lE[X]j i^^^o two parts: 



E 



Obviously, we have 
1 



■j,Nj<n°' j,Nj>n°' 



n3/2 



E 



E 

■j,Nj<n° 



< 



1 



rE 



^3/2 
<„-/2-l/2^0 



■j,Nj<n°' 
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when n goes to oo. For the second term, we have 
1 



^3/2 



E 



E 



n 



E 1 



The right-hand side of the inequahty is equal to P{Kn ^ n°'). Applying 
Lemma 4, we find that 



1 



E 



E 1 

■j,Nj>n°' -I 



~ Kn 



-a/2 



for some constant k. This expression converges to when n goes to oo. This 
completes the proof of Theorem 3. 

Remark 10. Fill and Kapur [12] established more precise results con- 
cerning E[^j- Nj]. They proved that 

1 



(18) 



E 



EV^: 



vr 



nlogn. 



Their results rely on Hadamard products. In a recent preprint, Ford [14] 
gave an alternate proof that {Cn — Sn)/n^^'^ converges in probability to 0. 
Note that our proof uses elementary arguments and is instructive in its own 
right as regards the shapes of Catalan trees. Besides, equation (18) follows 
easily from Lemma 4, by direct estimation of the sum by an integral. 



Proof of Theorem 4. In the following, we prove the convergence of 
the mixed moments of Sn and C„,. For k,l >0, we have 

(19) HS^Ci] ~ n3('=+^)/2E[^'=+^]. 

The argument is similar to the argument given by Janson (Remark 3.5, 
[18]) to establish the convergence of the mixed moments of the internal 
path length and the Wiener index. Since the convergence in distribution has 
been established in Theorem 3, the above equation is equivalent to uniform 
integrability of n~'^('^+^)/^S'^C'^ for n > 1 and any fixed k,i. Since Cn < Sn, 
the result follows from the fact that n~^^/'^{Sn)^ is uniformly integrable for 
every fixed k. This is true because Takacs [37] proved the convergence of the 
moments of Sn- D 
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